AITopics | training cost

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Neural Information Processing SystemsMar-21-2026, 21:53:26 GMT

Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1\% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.

artificial intelligence, large language model, natural language, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Simple and Fast Distillation of Diffusion Models Zhenyu Zhou 1,2 Defang Chen 3 Can Wang 1,2 Chun Chen

Neural Information Processing SystemsFeb-12-2026, 08:46:35 GMT

To achieve both efficient and high-quality synthesis, various distillation-based accelerated sampling methods have been developed recently.

artificial intelligence, diffusion model, machine learning, (15 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Security & Privacy (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

45a7ca247462d9e465ee88c8a302ca70-Paper-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 04:11:25 GMT

machine learning, natural language, pruning, (18 more...)

Neural Information Processing Systems

Country: Asia > Singapore (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

AppendixofSynergy-of-experts 1 TheoreticalProofs

Neural Information Processing SystemsFeb-12-2026, 02:15:40 GMT

From Figure 1(a), learning multiple linear sub-models and averaging the predictions (ensemble) is still a linear model, so it cannot tackleXOR problem. We compare the training cost of all methods from the two aspects;1). Thesub-model training enables themost adversarial attacks ofsub-models could be successfully defended. In particular, we train two kinds of models to defend against the attacks: 1). FromFigure2(a)and2(b),when0.01 ϵ 0.04, SoE without the collaboration training achieves a similar robustness compared with SoE.

adversarial sample, artificial intelligence, machine learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.35)

Add feedback

359ffa88712bd688963a0ca641d8330b-Paper-Conference.pdf

Neural Information Processing SystemsFeb-10-2026, 02:38:40 GMT

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Instructional Material (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

794a425a2e47e05d29d30f79b79a692d-Paper-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 23:44:37 GMT

international conference, sparse training, sparsity, (14 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Genre: Research Report (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

8dc5983b8c4ef1d8fcd5f325f9a65511-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 07:55:44 GMT

cifar-10 100, experiment, fractrain, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Travis County > Austin (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

FractionallySqueezingBitSavingsBoth

Neural Information Processing SystemsFeb-9-2026, 07:55:37 GMT

Recent breakthroughs in deep neural networks (DNNs) have motivated an explosive demand for intelligent edge devices. Many of them, such as autonomous vehicles and healthcare wearables, require real-time andon-site learning toenable them toproactivelylearn from newdataandadapt todynamic environments.

artificial intelligence, fractrain, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Primer: SearchingforEfficientTransformers forLanguageModeling

Neural Information Processing SystemsFeb-8-2026, 02:37:12 GMT

Weidentify anarchitecture, named Primer, that has a smaller training cost than the original Transformer and other variants for auto-regressive language modeling.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Europe > Germany (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.50)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Add feedback

Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training

Neural Information Processing SystemsDec-24-2025, 12:46:42 GMT

Recently, sparse training has emerged as a promising paradigm for efficient deep learning on edge devices. The current research mainly devotes the efforts to reducing training costs by further increasing model sparsity. However, increasing sparsity is not always ideal since it will inevitably introduce severe accuracy degradation at an extremely high sparsity level. This paper intends to explore other possible directions to effectively and efficiently reduce sparse training costs while preserving accuracy. To this end, we investigate two techniques, namely, layer freezing and data sieving. First, the layer freezing approach has shown its success in dense model training and fine-tuning, yet it has never been adopted in the sparse training domain.

generic framework, layer freezing & data sieving, training cost, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.76)

Add feedback

Filters

Collaborating Authors

training cost

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Simple and Fast Distillation of Diffusion Models Zhenyu Zhou 1,2 Defang Chen 3 Can Wang 1,2 Chun Chen

45a7ca247462d9e465ee88c8a302ca70-Paper-Conference.pdf

AppendixofSynergy-of-experts 1 TheoreticalProofs

359ffa88712bd688963a0ca641d8330b-Paper-Conference.pdf

794a425a2e47e05d29d30f79b79a692d-Paper-Conference.pdf

8dc5983b8c4ef1d8fcd5f325f9a65511-Supplemental.pdf

FractionallySqueezingBitSavingsBoth

Primer: SearchingforEfficientTransformers forLanguageModeling

Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training